智能论文笔记

Communication-Efficient Distributed SGD with Compressed Sensing

Yujie Tang , Vikram Ramanathan , Junshan Zhang , Na Li

分类：机器学习

2021-12-15

我们考虑通过连接到中央服务器的一组边缘设备的大规模分布式优化，其中服务器和边缘设备之间的有限通信带宽对优化过程提出了显着的瓶颈。灵感来自最近在联邦学习的进步，我们提出了一种分布式随机梯度下降（SGD）类型算法，该算法利用梯度的稀疏性，尽可能降低沟通负担。在算法的核心，用于使用压缩的感测技术来压缩器件侧的局部随机梯度;在服务器端，从嘈杂的聚合压缩的本地梯度恢复全局随机梯度的稀疏近似。我们对通信信道产生的噪声扰动的存在，对我们算法的收敛性进行了理论分析，并且还进行了数值实验以证实其有效性。

translated by 谷歌翻译

Evaluating Generalizability of Deep Learning Models Using Indian-COVID-19 CT Dataset

Suba S , Nita Parekh , Ramesh Loganathan , Vikram Pudi , Chinnababu Sunkavalli

分类：计算机视觉

2022-12-28

Computer tomography (CT) have been routinely used for the diagnosis of lung diseases and recently, during the pandemic, for detecting the infectivity and severity of COVID-19 disease. One of the major concerns in using ma-chine learning (ML) approaches for automatic processing of CT scan images in clinical setting is that these methods are trained on limited and biased sub-sets of publicly available COVID-19 data. This has raised concerns regarding the generalizability of these models on external datasets, not seen by the model during training. To address some of these issues, in this work CT scan images from confirmed COVID-19 data obtained from one of the largest public repositories, COVIDx CT 2A were used for training and internal vali-dation of machine learning models. For the external validation we generated Indian-COVID-19 CT dataset, an open-source repository containing 3D CT volumes and 12096 chest CT images from 288 COVID-19 patients from In-dia. Comparative performance evaluation of four state-of-the-art machine learning models, viz., a lightweight convolutional neural network (CNN), and three other CNN based deep learning (DL) models such as VGG-16, ResNet-50 and Inception-v3 in classifying CT images into three classes, viz., normal, non-covid pneumonia, and COVID-19 is carried out on these two datasets. Our analysis showed that the performance of all the models is comparable on the hold-out COVIDx CT 2A test set with 90% - 99% accuracies (96% for CNN), while on the external Indian-COVID-19 CT dataset a drop in the performance is observed for all the models (8% - 19%). The traditional ma-chine learning model, CNN performed the best on the external dataset (accu-racy 88%) in comparison to the deep learning models, indicating that a light-weight CNN is better generalizable on unseen data. The data and code are made available at https://github.com/aleesuss/c19.

translated by 谷歌翻译

Higher order organizational features can distinguish protein interaction networks of disease classes: a case study of neoplasms and neurological diseases

Vikram Singh , Vikram Singh

分类：机器学习

2022-12-26

Neoplasms (NPs) and neurological diseases and disorders (NDDs) are amongst the major classes of diseases underlying deaths of a disproportionate number of people worldwide. To determine if there exist some distinctive features in the local wiring patterns of protein interactions emerging at the onset of a disease belonging to either of these two classes, we examined 112 and 175 protein interaction networks belonging to NPs and NDDs, respectively. Orbit usage profiles (OUPs) for each of these networks were enumerated by investigating the networks' local topology. 56 non-redundant OUPs (nrOUPs) were derived and used as network features for classification between these two disease classes. Four machine learning classifiers, namely, k-nearest neighbour (KNN), support vector machine (SVM), deep neural network (DNN), random forest (RF) were trained on these data. DNN obtained the greatest average AUPRC (0.988) among these classifiers. DNNs developed on node2vec and the proposed nrOUPs embeddings were compared using 5-fold cross validation on the basis of average values of the six of performance measures, viz., AUPRC, Accuracy, Sensitivity, Specificity, Precision and MCC. It was found that nrOUPs based classifier performed better in all of these six performance measures.

translated by 谷歌翻译

ReCode: Robustness Evaluation of Code Generation Models

Shiqi Wang , Zheng Li , Haifeng Qian , Chenghao Yang , Zijian Wang , Mingyue Shang , Varun Kumar , Samson Tan , Baishakhi Ray , Parminder Bhatia

分类：机器学习 | 自然语言处理

2022-12-20

Code generation models have achieved impressive performance. However, they tend to be brittle as slight edits to a prompt could lead to very different generations; these robustness properties, critical for user experience when deployed in real-life applications, are not well understood. Most existing works on robustness in text or code tasks have focused on classification, while robustness in generation tasks is an uncharted area and to date there is no comprehensive benchmark for robustness in code generation. In this paper, we propose ReCode, a comprehensive robustness evaluation benchmark for code generation models. We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format. They are carefully designed to be natural in real-life coding practice, preserve the original semantic meaning, and thus provide multifaceted assessments of a model's robustness performance. With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt. In addition, we define robustness metrics for code generation models considering the worst-case behavior under each type of perturbation, taking advantage of the fact that executing the generated code can serve as objective evaluation. We demonstrate ReCode on SOTA models using HumanEval, MBPP, as well as function completion tasks derived from them. Interesting observations include: better robustness for CodeGen over InCoder and GPT-J; models are most sensitive to syntax perturbations; more challenging robustness evaluation on MBPP over HumanEval.

translated by 谷歌翻译

CoCoMIC: Code Completion By Jointly Modeling In-file and Cross-file Context

Yangruibo Ding , Zijian Wang , Wasi Uddin Ahmad , Murali Krishna Ramanathan , Ramesh Nallapati , Parminder Bhatia , Dan Roth , Bing Xiang

分类：自然语言处理

2022-12-20

While pre-trained language models (LM) for code have achieved great success in code completion, they generate code conditioned only on the contents within the file, i.e., in-file context, but ignore the rich semantics in other files within the same project, i.e., cross-file context, a critical source of information that is especially useful in modern modular software development. Such overlooking constrains code language models' capacity in code completion, leading to unexpected behaviors such as generating hallucinated class member functions or function calls with unexpected arguments. In this work, we develop a cross-file context finder tool, CCFINDER, that effectively locates and retrieves the most relevant cross-file context. We propose CoCoMIC, a framework that incorporates cross-file context to learn the in-file and cross-file context jointly on top of pretrained code LMs. CoCoMIC successfully improves the existing code LM with a 19.30% relative increase in exact match and a 15.41% relative increase in identifier matching for code completion when the cross-file context is provided.

translated by 谷歌翻译

Continual Mean Estimation Under User-Level Privacy

Anand Jerry George , Lekshmi Ramesh , Aditya Vikram Singh , Himanshu Tyagi

分类：机器学习

2022-12-20

We consider the problem of continually releasing an estimate of the population mean of a stream of samples that is user-level differentially private (DP). At each time instant, a user contributes a sample, and the users can arrive in arbitrary order. Until now these requirements of continual release and user-level privacy were considered in isolation. But, in practice, both these requirements come together as the users often contribute data repeatedly and multiple queries are made. We provide an algorithm that outputs a mean estimate at every time instant $t$ such that the overall release is user-level $\varepsilon$-DP and has the following error guarantee: Denoting by $M_t$ the maximum number of samples contributed by a user, as long as $\tilde{\Omega}(1/\varepsilon)$ users have $M_t/2$ samples each, the error at time $t$ is $\tilde{O}(1/\sqrt{t}+\sqrt{M}_t/t\varepsilon)$. This is a universal error guarantee which is valid for all arrival patterns of the users. Furthermore, it (almost) matches the existing lower bounds for the single-release setting at all time instants when users have contributed equal number of samples.

translated by 谷歌翻译

Plankton-FL: Exploration of Federated Learning for Privacy-Preserving Training of Deep Neural Networks for Phytoplankton Classification

Daniel Zhang , Vikram Voleti , Alexander Wong , Jason Deglint

分类：机器学习 | 计算机视觉

2022-12-18

Creating high-performance generalizable deep neural networks for phytoplankton monitoring requires utilizing large-scale data coming from diverse global water sources. A major challenge to training such networks lies in data privacy, where data collected at different facilities are often restricted from being transferred to a centralized location. A promising approach to overcome this challenge is federated learning, where training is done at site level on local data, and only the model parameters are exchanged over the network to generate a global model. In this study, we explore the feasibility of leveraging federated learning for privacy-preserving training of deep neural networks for phytoplankton classification. More specifically, we simulate two different federated learning frameworks, federated learning (FL) and mutually exclusive FL (ME-FL), and compare their performance to a traditional centralized learning (CL) framework. Experimental results from this study demonstrate the feasibility and potential of federated learning for phytoplankton monitoring.

translated by 谷歌翻译

Adaptive ECCM for Mitigating Smart Jammers

Kunal Pattanayak , Shashwat Jain , Vikram Krishnamurthy , Chris Berry

分类：机器学习

2022-12-05

This paper considers adaptive radar electronic counter-counter measures (ECCM) to mitigate ECM by an adversarial jammer. Our ECCM approach models the jammer-radar interaction as a Principal Agent Problem (PAP), a popular economics framework for interaction between two entities with an information imbalance. In our setup, the radar does not know the jammer's utility. Instead, the radar learns the jammer's utility adaptively over time using inverse reinforcement learning. The radar's adaptive ECCM objective is two-fold (1) maximize its utility by solving the PAP, and (2) estimate the jammer's utility by observing its response. Our adaptive ECCM scheme uses deep ideas from revealed preference in micro-economics and principal agent problem in contract theory. Our numerical results show that, over time, our adaptive ECCM both identifies and mitigates the jammer's utility.

translated by 谷歌翻译

Deep Surrogate Docking: Accelerating Automated Drug Discovery with Graph Neural Networks

Ryien Hosseini , Filippo Simini , Austin Clyde , Arvind Ramanathan

分类：机器学习

2022-11-04

The process of screening molecules for desirable properties is a key step in several applications, ranging from drug discovery to material design. During the process of drug discovery specifically, protein-ligand docking, or chemical docking, is a standard in-silico scoring technique that estimates the binding affinity of molecules with a specific protein target. Recently, however, as the number of virtual molecules available to test has rapidly grown, these classical docking algorithms have created a significant computational bottleneck. We address this problem by introducing Deep Surrogate Docking (DSD), a framework that applies deep learning-based surrogate modeling to accelerate the docking process substantially. DSD can be interpreted as a formalism of several earlier surrogate prefiltering techniques, adding novel metrics and practical training practices. Specifically, we show that graph neural networks (GNNs) can serve as fast and accurate estimators of classical docking algorithms. Additionally, we introduce FiLMv2, a novel GNN architecture which we show outperforms existing state-of-the-art GNN architectures, attaining more accurate and stable performance by allowing the model to filter out irrelevant information from data more efficiently. Through extensive experimentation and analysis, we show that the DSD workflow combined with the FiLMv2 architecture provides a 9.496x speedup in molecule screening with a <3% recall error rate on an example docking task. Our open-source code is available at https://github.com/ryienh/graph-dock.

translated by 谷歌翻译

Robust Causality and False Attribution in Data-Driven Earth Science Discoveries

Elizabeth Eldhose , Tejasvi Chauhan , Vikram Chandel , Subimal Ghosh , Auroop R. Ganguly

分类： (统计)机器学习

2022-09-26

因果和归因研究对于地球科学发现至关重要，对于为气候，生态和水政策提供信息至关重要。但是，当前的方法需要与科学和利益相关者挑战的复杂性以及数据可用性以及数据驱动方法的充分性相结合。除非通过物理学进行仔细的通知，否则它们会冒着将相关性与因果关系相关或因估计不准确而淹没的风险。鉴于自然实验，对照试验，干预措施和反事实检查通常是不切实际的，因此已经开发了信息理论方法，并在地球科学中不断完善。在这里，我们表明，基于转移熵的因果图最近在具有备受瞩目的发现的地球科学中变得流行，即使增强具有统计学意义，也可能是虚假的。我们开发了一种基于子样本的合奏方法，用于鲁棒性因果分析。模拟数据以及气候和生态水文中的观察表明，这种方法的鲁棒性和一致性。

translated by 谷歌翻译